SNOW-2105991: Use pre-computed row counts more aggressively#3358
Merged
sfc-gh-joshi merged 6 commits intomainfrom May 21, 2025
Merged
SNOW-2105991: Use pre-computed row counts more aggressively#3358sfc-gh-joshi merged 6 commits intomainfrom
sfc-gh-joshi merged 6 commits intomainfrom
Conversation
🎉 Snyk checks have passed. No issues have been found so far.✅ security/snyk check is complete. No issues have been found. (View Details) ✅ license/snyk check is complete. No issues have been found. (View Details) |
8 tasks
07753f9 to
7197420
Compare
sfc-gh-lmukhopadhyay
approved these changes
May 20, 2025
Contributor
sfc-gh-lmukhopadhyay
left a comment
There was a problem hiding this comment.
This LGTM! This might affect the perf regression stats eventually if the query benchmark hasn't already been removed yet I believe?
Contributor
Author
Yes, though in most cases the value would go down. |
7197420 to
0b47722
Compare
sfc-gh-mvashishtha
approved these changes
May 20, 2025
Co-authored-by: Mahesh Vashishtha <mahesh.vashishtha@snowflake.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.
Fixes SNOW-2105991
Fill out the following pre-review checklist:
Please describe how your code solves the related issue.
SNOW-1900040 (#3144) added row count estimation that propagates across pandas dataframe operations, including storing precise row count information for frames constructed from native pandas/python objects and frames created by
read_snowflake. This PR extends that earlier work by propagating the precise row count value across certain ordered dataframe operations, namelyselect,union_all, andsort, which all have predictable effects on the size of the resulting frame. This PR also changes internal methods that retrieve the frame's row count (used in many operations for bounds checking or other validation) to use this cached row count value rather than issuing a query.In short, retrieving the length of a dataframe created from native pandas/python or directly by
read_snowflakewill no longer issue an extra query. This reduces query counts across large parts of the test suite. Some notable affected APIs includerepr,crosstab,insert,loc,iloc,iterrows, andgroupbyoperations withbyspecified as a native list object.I did not take benchmarks for most of these operations; the removal of a query should strictly represent an improvement. I did benchmark changes to the
reproperation, as some more work was needed to take advantage of the cached row count for that API (following the approach taken in SNOW-1705797/#2760). At some point between 4/15 and 4/22 (dashboard metrics link; during this period the daily benchmark runner experienced some downtime due to modin versioning issues so the exact date is lost), repr for very large dataframes began taking almost twice as long as previously. I did not investigate the root cause, but this work remedies some of the performance impact.Performance for
repr(df)on this PR (b204452) vs. main (4b5feb)